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Abstract 

The extraction of robust parton distribution functions with faithful errors requires a careful 
treatment of the uncertainties in the experimental results. In particular, the data sets used 
in current analyses each have a different overall multiplicative normalization uncertainty 
that needs to be properly accounted for in the fitting procedure. Here we consider the 
generic problem of performing a global fit to many independent data sets each with a 
different overall multiplicative normalization uncertainty. We show that the methods in 
common use to treat multiplicative uncertainties lead to systematic biases. We develop a 
method which is unbiased, based on a self-consistent iterative procedure. We then apply 
our generic method to the determination of parton distribution functions with the NNPDF 
methodology, which uses a Monte Carlo method for uncertainty estimation. 
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1 Introduction 

The interpretation of forthcoming experiments at the Large Hadron Collider requires the 
development of precision statistical analysis tools. One context where this is especially 
apparent is the determination of the parton distribution functions of the proton, which are 
obtained by global analysis of existing data sets (see pQ for a review). These PDFs, together 
with their associated uncertainties, should be without theoretical prejudice, and should be 
associated with a genuine statistical confidence level. In a series of recent papers [2H7] 
the NNPDF collaboration has adopted a method based on a Monte Carlo estimate of 
uncertainties [8] that allows one to propagate the uncertainty on the experimental data 
to the fitted PDFs, and to any other quantity which depends on them. The PDFs are 
parametrized by neural networks, containing large numbers of free parameters, sufficient to 
ensure that the resulting ensembles of fitted PDFs are free from any bias due to assumptions 
about the underlying functional form. Within this context, in which one aims typically at 
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accuracies at the percent level in physical observables, more subtle problems in data analysis 
become relevant. One such problem is the treatment of normalization uncertainties. 

When combining data sets from independent experiments it is necessary to take account 
of the overall normalization uncertainty associated with each experiment: an experiment 
with large normalization uncertainty should contribute less to the fit than one with a small 
uncertainty. Normalization uncertainties are usually multiplicative, in the sense that each 
data point within the set has a normalization uncertainty proportional to the measurement 
at that point. All these normalization uncertainties are however correlated across the whole 
set of data points. Fitting the data with the usual Hessian method, using the complete 
covariance matrix, leads to a substantial bias in the fitted value due to the fact that smaller 
data points are assigned a smaller uncertainty than larger ones [9] . This problem is usually 
avoided by using the "penalty trick": the normalization of each data set is treated as a 
free parameter to be determined during the fitting, within a range restricted by the quoted 
experimental uncertainty. While this method indeed gives correct results when fitting data 
from a single experiment, we will show below that it remains biased when used in fits which 
combine several different data sets. 

In this paper we develop a treatment of normalization uncertainties which is always free 
from bias, even when fitting to many different data sets. While the original motivation of 
this study comes from PDF determination, all our results are of general relevance whenever 
a quantity has to be determined from data which are affected by multiplicative uncertainties. 
Hence most of our discussion will be completely general, and we will only consider the issue 
of PDF determination in the end, as an example of the application of our proposed method. 

The paper is constructed as follows: in Sect. 2 we review the Hessian and Monte Carlo 
methods for the extraction of a quantity from a set of experimental measurements, and how 
they might be implemented when there are normalization uncertainties. In Sect. 3 we review 
the well-known fact that using the full covariance matrix for the treatment of normalization 
uncertainties leads to a bias for a single experiment, and we show that further biases arise 
when combining several experiments. In Sect. 4 we show that the so-called "penalty trick" 
method commonly used to overcome this bias, while working well for a single experiment, 
still leads to biased results when used to combine results from several experiments. In 
Sect. 5 we attempt to construct a self-consistent covariance matrix, but find that this too 
leads to biased results when used to combine several different experiments. In Sect. 6 we 
cure the defect in this method by determining the self-consistent covariance matrix through 
an iterative procedure using as a starting point the result of a previous fit: we show that this 
method (the "io-method" ) is completely unbiased and rapidly convergent. Finally in Sect. 7 
we discuss the treatment of normalization errors in PDF fitting: after a brief summary of 
the methods used currently, we discuss the effect of our new £o~ me thod on the NNPDF1.2 
parton fit. Specifically, we demonstrate explicitly the practicality and convergence of the 
method, and use it to quantify the effect of including the normalization uncertainties in the 
determination of PDFs. 

The language of the paper is that of a particle physicist (and thus similar in approach 
to e.g. Ref. [9]), not a professional statistician. We thus make no pretence of mathematical 
rigour, but hope nonetheless that our results will be of practical use to physicists interested 
in fitting large datasets. 
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2 Hessian and Monte Carlo 



2.1 Hessian Methods 

Consider a simple but typical experimental situation in which we have n measurements 
raj of a single theoretical quantity t, with experimental uncertainties given by a covariance 
matrix (cov)jj-: typically this takes the form 

n 

(cov)jj = dijfjf + ^2 a ik a kj , (1) 
k=i 

where the o", are uncorrelated uncertainties (typically obtained as the sum in quadrature 
of statistical and uncorrelated systematic uncertainties) and uik are correlated (typically 
systematic) uncertainties. All experimental uncertainties are generally assumed to be Gaus- 
sian. Then the least squares estimate for t is given by minimizing the \ 2 function 

n 

X 2 (0 = (* - rrH)(cav-%(t - rrij), (2) 



'■j= 



and thus by 

E^iCcov- 1 )^- 



The variance of t, Vu is found through 

V.-f^A' 1 - 1 (4) 

Vtt -y*iw) -E^-=i(oov-%-- (4) 

Of course when there are several quantities t to be determined, the variances are given 
by the diagonal elements Vu of the matrix determined by inversion of the Hessian matrix 
of second derivatives of x 2 '- hence the name of the method. Here the simplest case of only 
one quantity t will be sufficient to illustrate the points we wish to make. The situation is 
yet more complicated when fitting parton distribution functions: then, t is actually some 
nontrivial but calculable function (such as a cross section or structure function) of the 
(many) fitted parameters which describe the shape of the underlying parton distribution 
functions which is being determined. Again this complication is irrelevant to the issues to 
be discussed here, so we will ignore it. 

In the very simplest case in which the measurements mj have completely uncorrelated 
(e.g. purely statistical) uncertainties Oj, 

(cov)ij = (covo)ij = of<%, (5) 

so 

X\t) = ±^l, (6) 
i=l ai 

whence at the minimum t = w and Vjt = S 2 , where 

n 

i=i * 
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and S 2 is given by 

1 n 1 

i=i 1 

Thus the theoretical quantity t is given by the average value of the measurements rrii 
weighted by the inverses of the variances af , and the inverse of the variance of t is likewise 
the average of the inverses of the variances af . 

When all the variances are equal, u\ = a, and w = m, where 

1 " 

m=-V]mi, (9) 

i=l 

i.e. the unweighted average of the measurements. In this limiting case, the determination 
Eq. ([3]) of t is manifestly seen to be unbiased (and in particular it tends to the true value in 
the limit of large number of measurements): no measurement is preferred, and X 2 = cr 2 /n, 
so the variance is reduced by a factor of 1/ra, due to there being n independent measurements 
of the same quantity. In what follows we will use the words "biased" and "unbiased" to 
describe estimates of the mean which pass or fail this simple test. 

In the opposite extreme, if one of the variances, say af, becomes very large compared 
to the others, the contribution of the measurement rrii to w and X 2 becomes very small, so 
this measurement decouples from the rest as it must. In what follows we shall use freedom 
from bias (as defined above) and decoupling as two criteria to assess the usefulness of a 
particular method of determining t and its variance^] 



2.2 The Monte Carlo Method 

A different way of determining a theoretical quantity from a set of measurements is to 
construct a Monte Carlo representation of the data. First, the n data points are associated 
to n random variables Mi, normally distributed around the averages rrii according to the 
covariance matrix (cov)y. Then, an ensemble {Mi} of replicas of the data is constructed: 
these by construction satisfy 

(Mi) = rrii, (MiMj) = m imj + (cov)^-, (10) 

where () denotes averaging over the set of replicas. Finally an ensemble {T} of replicas 
of the theoretical quantity t is determined from the data replicas {Mi}, by minimizing a 
suitable error function Emc(T)- The mean value and variance of t (and indeed any other 
function of t) may then be found simply by averaging over the replicas: 

E[t] = (T), Var[i] = (T 2 ) - (T) 2 . (11) 

This method becomes advantageous when the determination of some function of t is 
called for: once the ensemble of replicas {T} has been found, error propagation to any 
function of t, no matter how complicated, may be performed by simply averaging over 

1 Note that the terms "bias" and "decoupling" as used here are not quite the same as the technical 
definitions of consistency and bias used by statisticians, which distinguish more carefully between results 
obtained with finite size samples and those when the sample size becomes infinite. 
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replicas. This is especially useful in situations in which t is multidimensional, or a nontrivial 
function of some underlying theoretical quantity (such as a PDF). 

Clearly, the features of the result obtained with this method depend on the choice of 
error function Emc(T) which determines the the ensemble of replicas {T} from the data 
replicas {M{\. In the simple situation discussed in the previous section we may choose 
Emc(T) = x 2 {T), where the x 2 function is given by Eq. (|2|). Then 

EJViCcov- 1 );^ 
Ei J= i(cov % 



so that, using Eq. (fTT]) . 
while 

Var[t] = 



y^-ifcov )amj 

m = , _ - , (13) 



\ 2 

E" j= i(cov-%j 

(14) 



in agreement with Eq. ([3]) and Eq. @ found using the Hessian method. 



2.3 Normalization Uncertainties 

The two methods discussed in the previous two sections cover most situations of uncorre- 
cted and correlated errors found in combining experimental data. However problems arise 
when data sets have overall multiplicative uncertainties, such as normalization uncertain- 
ties: this is due to biases arising from the rescaling of errors [91-111]. The effect of these 
biases in the naive application of the Hessian method can be very severe, as will be dis- 
cussed in the following section. Here we consider normalization uncertainties in the Monte 
Carlo method, which is possibly more straightforward. 

First consider the situation in which all the data come from a single experiment, with 
a single overall normalization uncertainty s, assumed to be Gaussian. In the Monte Carlo 
method the normalization uncertainty is taken into account by multiplying the data by a 
random factor N, which is normally distributed around 1 with variance s. Assuming iV to 
be uncorrelated with Mi 

(N) = 1, (N 2 ) = l + s 2 , {f(N)g{Mi)) = (f (N)) (g(Mi)) . (15) 

When fitting to the replicas, the error function will depend on NM{, but the weights of the 
different measurements will be unchanged, since an overall rescaling of the data should not 
affect the relative weight of the measurements in the fit. Thus we now take 

n 

E MC (T) =J2(T~ NMiXcov- 1 )^ - NMj) (16) 



5 



where (cov)jj is the same covariance matrix used when there was no normalization uncer- 
tainty. The result of the minimization of Eq. (|16p is the same as Eq. (|12p . but with an 
overall factor of N 

T = N Z 1^- 1 \ Z \ (17) 

so E[t] is the same as in Eq. (|13p . while now 

E- Jlfc ,/=i(cov- 1 ) ii (cov- 1 ) M ((iV 2 )(M J M,) - <JV> a <Af J -><Af,» 



Var[t] 



(a=i(c°v-%' a 
+ s 2 E[t} 2 . (18) 



1 + s 2 



Er ? =i(cov r 



The first term on the right-hand side of Eq. (|18p is the same as was found previously in 
Eq. (|14p . but with an extra factor of 1 + s 2 , while the second term is the contribution to 
the variance from the normalization uncertainty. This is as expected: indeed, the variance 
of a product of random variables is 

Var[iVAf] = E[N] 2 Vav[M] + E[M] 2 Var[iV] + Var[iV]Var[M], (19) 

where the last term is usually neglected because, being a product of two variances, it 
corresponds to a higher order moment of the probability distribution for NM. 
For the simple case of uncorrelated statistical measurement errors Eq. (J5]), 

Euc(T ^±^M£, (20) 



whence 

T = iVS2 E9> (21) 



Mi 



i=i a i 

so E[i] = w, Eq. (HD, while 

Var[t] = £ 2 + sV + s 2 X 2 . (22) 

Clearly then this method gives correct unbiased results for the case of a single experiment 
with normalization uncertainty. 

Let us now consider a slightly more complex situation, where each of the measure- 
ments rrii comes from a different experiment, and also has an independent normalization 
uncertainty Sj. Here and henceforth when discussing this situation we will neglect pos- 
sible correlations between these measurements with independent uncertainties, which are 
usually absent if the measurements are obtained from independent experiments; however, 
the inclusion of such correlations is straightforward and it does not affect our subsequent 
results. Following the same line of reasoning as above, we want to derive the fitted value for 
t and its variance in the Monte Carlo approach. We thus introduce independent normally 
distributed random variables iVj to represent the normalization uncertainties, each with 
mean one, variance sf, and uncorrelated to each other and to the Mf. 

(Ni) = 1, (NiNj) = 1 + sf%, (f(Ni)g(Mj)) = (f(Ni))(g(Mj)). (23) 
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The difficulty now is to choose an appropriate error function. The simplest choice would 

*MC(T)^£<™>!. (24) 
f— 1 erf 

Minimizing with respect to T and averaging over the replicas then gives E[t] = w as before, 
but 

n 

Var[t] = T 1 A Y j {{N i N j ){M i M j )-{N i ){N j ){M i ){M j ))/a 2 i al 

n 

= £ 2 + £ 4 £ S 2 (m 2 + a 2 )Ax 4 . (25) 
i=i 

When all the experiments have the same normalization uncertainty, Si = s, these results 
are not so unreasonable: E[i] = w, while 

n 2 

Var[i] = (1 + s 2 )£ 2 + s 2 S 4 £ ^ . (26) 

i=i ai 

However Eq. (|24p is clearly incorrect in general because differences in the normalization 
uncertainties Si are not taken account of in the weighting of the different measurements. In 
particular if one of the experiments has a relatively large normalization uncertainty, it still 
contributes to the mean, but spoils the measurement by giving a very large contribution to 
the variance. 

Therefore, results found using the Monte Carlo method with the error function Eq. (I24j) 
are unbiased when the normalization uncertainties are equal, but do not satisfy the criterion 
of decoupling. It follows that when we have more than one experiment, and in particular 
when we wish to include experiments with a large overall normalization uncertainty, we 
need to choose a better error function than Eq. (|24p . which incorporates differences in the 
normalization uncertainties Sj. We are thus led to consider error functions built using the 
full covariance matrix, including normalization uncertainties, and thus rather closer to the 
X 2 -function. 



3 The d'Agostini Bias 

We saw in the previous section that when we are combining different experiments with 
independent and different normalization uncertainties, it is necessary to incorporate these 
differences into the x 2 -f unc tion or error function used in the fitting procedure. This is true 
both for the Hessian method and for the Monte Carlo method. In the previous section we 
have shown that this is easily done in the case of a single experiment, but when several 
experiments must be combined the simplest choice of error function Eq. ()24|) leads to results 
which do not satisfy the decoupling criterion. In this section we will see that in the Hessian 
case the simplest choice of error function leads to results which are severely biased even in 
the case of a single experiment. 

Specifically, we consider the case in which normalization uncertainties are included 
by using as an error function the x 2 -fu n ction computed using the full covariance matrix 
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including normalizations: 

n 

xl(t) = (t - rmXcov-^it - m 3 ). (27) 

As in the previous section we consider two cases separately: when we have n measurements 
all made within a single experiment, and thus with a common normalization uncertainty, 
so 

(cov m )ij = (cov)ij + s 2 mimj, (28) 

and then when each of the n measurements is made in an independent experiment, all 
uncorrelated, and in particular with different normalization uncertainties, so 

(cov m )jj = (of + sfmj)Sij. (29) 

Note that the more realistic case in which there are n eX p experiments each with n^ at mea- 
surements can be built from these two simpler examples by first combining the many mea- 
surements in each individual experiment together into one measurement, using Eq. (|28l) . 
and then combining the experiments using Eq. (129j) . so these two cases should suffice to 
illustrate all the issues involved. 



3.1 One experiment 

Consider first a very simple model of a single experiment with only two data points. The 
covariance matrix Eq. (|28|) is then simply 

i \ _ ( °i + s2rn i s 2 mim 2 \ ( s 

(COV m jy - ^ s 2 mim2 a 2 + g 2 m 2 J W 



so the x Eq. ([27]) is 

2 , v _ (t — m\) 2 (a\ + mis 2 ) + (t — m 2 ) 2 {a\ + m\s 2 ) - 2(t - mi)(t - m 2 )m 1 m 2 s 2 
Xm[t) ~ a 2 a 2 + (mja 2 + mJ^J? ' 

(31) 

Minimizing this x 2 -f unc ti° n with respect to t gives after a straightforward calculation the 
result 

, = mi/oj + m 2 /al = w 

l/cr2 + l/ f7 2 + ( mi _ m2 )2 s 2/ (T 2 fT 2 l + ( mi _ m2 )2 s 2/ S 2' ^ > 

where w is the weighted mean Eq. (J7J) with n = 2. 

It follows that when mi ^ m 2 and s / the result for t has a downward shift. That this 
shift is clearly a bias can be seen for instance by considering the simple case o\ = a 2 = o . 
Then Eq. fl32]) gives 

* = t ■ 9 2^2-2/ 2 = ™(! " 2r 2 s 2 m 2 /a 2 + 0(r 4 )). (33) 
1 + 2r z s z m z /(j z 

where we have defined 

- ~\ / \ mi — 777-2 /„ ,\ 

777 =^(7771 -1- 7772), T= . (34) 

A 7771 + 7772 
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Thus simply minimizing the x 2 derived from the correlated covariance matrix Eq. (j'28|) leads 
to a central value which is shifted downwards: for a sufficiently large s 2 /a 2 one can get an 
average which is lower than either of the two values which are being averaged, so one must 
conclude that the result is biased. 

This bias gets worse as the number of data points increases [9]I11]IT2]. For n data nii, 
with statistical uncertainties Uj, the covariance matrix given by Eq. (j28[) and Eq. ([5]) has 
inverse 

( - ^ - mim i g2 (*k\ 

[COV m }ij - l + s2m2/s2 • m 



where 



m 2 = s 2^^|_ (36) 

The x 2 -function Eq. (j27|) is then minimized when 



where w is defined in Eq. (J7J) while r is defined through 

m 2 - w 2 = £ 2 £ (mt ~ Wj = rW. (38) 

So again we have a downwards bias unless s 2 = or all the measurements m, are equal. 
Note further that when the data are consistent and n is large, r 2 is simply given by the 
variance £ 2 of the measurements: r 2 w 2 ~ nS 2 . The bias is thus by a factor 1/(1 + ns 2 ), 
which will become arbitrarily large as the number of data points increases. 

The origin of the bias is clear: smaller values of nii have a smaller normalization un- 
certainty rriiS, and are thus preferred in the fit. Several examples of situations where this 
leads to absurd results may be found in Ref. [9]. The variance of t is afflicted by the same 
downward bias: 

_ E 2 + ^ 2 (l + r 2 ) 
1 + r z s z w z 

We will henceforth refer to this as the "d'Agostini bias", after Ref. [9j[IT] where it was 
studied and explained. 

3.2 More than one experiment 

In the second example with n distinct experiments the d'Agostini bias is much milder. With 
the covariance matrix Eq. (|29p . the x 2 -function is 

x^) = E ( 2T m f 2 > (40) 

of + mfsf 



which is minimized when 



2->i=l a 'f+s 2 m'f 
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To exhibit the bias, consider again the case when all statistical uncertainties are equal, 
cjj = a: then on expanding in powers of 

i n I - \1 -i n 2 -2 

^lV (m '- m) = lym i -m 
n ^— ' m 2 n ^-^ m 2 

i=l i=l 

with fh given by Eq. ([9]) we find 

t = m(l-2r 2 f™\_ 2 + 0(r 4 )V (43) 

The origin of the bias is the same as in Eqs. (|33|37p . and indeed for two data points the biases 
are approximately the same: however for n independent experiments, the bias Eq. (|43p is 
stable for large n. In this case the variance is given by (again expanding in r) 

V tt = ±(a 2 + s 2 fh 2 (l + r 2 )) + 0(r i ), (44) 

which is as expected, since m 2 (l + r 2 ) = ^ Y^l=i m ?- 



4 The Penalty Trick 

The standard [51111(112] way to include normalization uncertainties in the Hessian approach 
while avoiding the d'Agostini bias consists of including the normalizations of the data n{ 
as parameters in the fit, with penalty terms to fix their estimated value close to one with 
variance sj. In this section we will discuss this widely used method, which we will refer to 
as the "penalty trick": we will see that while it gives correct results for a single experiment, 
when used to combine results from several experiments it is actually still biased. 



4.1 One Experiment 

We first consider a single experiment, with covariance matrix Eq. ©, but now with an 
overall normalization uncertainty with variance s 2 . The value of t is then obtained by 
minimizing the error function 

^ / % (tin - rrii) 2 (n-1) 2 . 

^Hess (t , n = V ^ + . (45) 

r-f of s 2 

i=i 1 

where the last term is called the penalty term. The parameters t and n are then determined 
by minimizing this error function: minimizing with respect to t gives t = nw, with w as 
defined in Eq. ((7|), while minimization with respect to n fixes n = 1, so for the central 
values the result is the same as in the Monte Carlo approach^ 



2 Note that to obtain an unbiased result from this approach, the factor n must rescale the theory, not 
the data: if instead of Eq. (|45[) we took 

(t - nrrii) 2 (n - l) 2 



^ \j. - arm) 

^Hess - Tfl + 



i=l « 

we would get a result with a strong downward bias similar to that in Eq. (|37[) [9llll| . 
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In order to compute the error on the fitted quantity in this approach, we need to evaluate 
the Hessian matrix: 

\dndt V / 1 / 

The covariance matrix is obtained by inverting V . In this way one recovers 

V tt = S 2 + s 2 w 2 . (47) 

This is the same result as Eq. ()22[) obtained with the Monte Carlo approach, apart from 
the cross-correlation term between the variances of the measurements and the variances of 
the normalization, akin to the last term in Eq. (|19|) . 

For the case of a single experiment, we have thus recovered within the Hessian approach 
the Monte Carlo result of Sect. 12.31 in this simple case the Hessian and Monte Carlo 
methods are (almost) equivalent. It is straightforward to generalize this equivalence to a 
general covariance matrix Eq. (pQ). 



4.2 More than one experiment 

Let us now turn to the more complex situation where we have several data points from 
different experiments, and thus with independent normalizations, = 1 ± Sj. In this case, 
the Monte Carlo result Eq. ([25]) is unbiased (in the sense that it gives an unbiased average 
over the data when all uncertainties are equal), but it does not satisfy the requirement of 
decoupling. We will now show that the penalty trick leads to a result which does satisfy 
decoupling, but is biased when all uncertainties are equal. 

Following the same line of reasoning as for the single experiment above, we set up the 
error function 



t? n \ (t/m - mj) 2 ^ (n f - l) 2 

Eu css {t, rii) = 2^ 2 h Z> 2 ' ( 48 ) 



=1 °i i: s i 



where now we have a separate penalty term for each of the normalizations to be fitted. The 
minimum is obtained for 



En rrij 
»=1 ?li(T 2 

En 1 



(49) 



s 2 t ( t \ 
m = 1 + -^-g mi) . (50) 

These n + 1 equations are now complicated nonlinear relations which must be solved for t 
and 7ij. A general analytic solution is probably impossible, and it seems very likely that for 
a large number of experiments the number of solutions will grow rapidly, making it difficult 
to select the correct one. However it is possible to find solutions for certain special cases, 
which are sufficient to show that the approach is biased. 

First we specialize to the case of only two experiments: to explore the bias we again 
assume that a\ = 02 = <r and s\ = S2 = s, as in Sect. 3.1. Adding and subtracting the two 
equations for n\ and n2, and substituting the equation for t, we find 

n i + n 2 = n i + n 2 > n\ — n\ = —2Aniri2 , (51) 
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c(x) 




Figure 1: The "bias" functions c(x) for the central value (left plot) and v(x) for the variance (right 
plot), as defined in Eqs. (|57ti58[) . corresponding to the results obtained when the d'Agostini bias is 
present, Eqs. (|43ti44|) . using the penalty trick, Eqs. (|55H56[) . and using the self-consistent covariance 
matrix method, Eqs. (|75M76[) . The unbiased result corresponds to c = v = 0. 



where 



s 2 



A = ^ Mm\-m\). (52) 

a 2 + m 1 m 2 s 2 2 

Solving these for ri\ and n 2 , and choosing the solution close to one (note that A will 
generally be rather small) gives 

= K 1 + ^tIj)' (53) 



where fa and r are defined in Eq. 1^4 

The variance may now be computed as before by inversion of the three by three Hessian 
matrix. The result is not very enlightening, however it simplifies if we expand in powers of 
r: the Hessian method then gives 



,2^2 ^2 _2s 2 m 2 ' 



V tt = \ a 2 + (l + r 2 - 4 W + 0(r 4 ) ) ■ (56) 



Clearly, both the central value and variance Eqs. (I55|56p are biased: the normalization 
uncertainties leads to data being weighed differently based on their central values, even 
when their normalization uncertainties are the same. It is not difficult to show from the 
structure of Eqs. (|49|50p that the results Eqs. (|55)56p remain true for any number of data 
points, i.e. for all n > 2, provided that rh and r are defined as in Eqs. (|9|38p . and an overall 
factor of 2/n is included in the variance. 

This bias is more subtle than the d'Agostini bias Eq. P3j) . in that it is caused by 
nonlinearities in the error function rather than by a consistent bias in the variances. It 
can thus have either sign, depending on the relative weight of statistical and normalization 
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Figure 2: Dependence of the central value t on the ratio of normalization uncertainties S\ and S2 
for a pair of measurements with central values mi = 0.9, r>i2 = 1.1 and negligible uncertainties 
cTj. The curves correspond to the cases of the penalty trick method Eq. to the self-consistent 
covariance matrix method Eq. ([78]). and to the t method Eq. (|98|). The unbiased result must be 
symmetric about the point L = (i.e. si = S2): only the to curve is unbiased. Decoupling of the 
data point with the larger error is clearly visible in the graph for L — > ±00. 



uncertainties: we can rewrite Eqs. (I55H56P in the form 



t = m(l + c(^)r 2 + 0(r 4 )), (57) 

Vtt = ^ 2 + ± S 2 m 2 (l + r 2 + *;(^)r 2 + 0(r% (58) 

with the functions c and v given by 

x - 2 . , (2x - l) 2 

c ( x ) = 1 TTo> v(x) = (59) 

v ; (x + 1) 2 x ' (x + l) 3 v ; 

The unbiased results would correspond to c = v = 0. The d'Agostinh-biased results 
Eqs. (|43M4p can also be cast in the form of Eq. (!57H58p . but now with c(x) = —2/{x + 1), 
v(x) = (since in this case the variance is unbiased at 0(r 2 )). 

The "bias" functions c(x) and v(x) for these two cases, as well as for a further biased 
case to be discussed below in Sect. [SJ are compared in Fig. [TJ Thanks to the penalty trick, 
the bias in the central value is generally less severe, but the variance is now also biased. 

It is also interesting to explore decoupling in this approach. To do this we again consider 
two experiments, in the special case in which the normalization errors dominate, so a 2 , cr 2 <C 
s 2 m 2 ,s 2 m 2 . Then Eqs. (|49|50p can again be solved: n« = t/rrii, with 

m 1 sj + m 2 sl 

t = ra\rri2 — 9-^5 5-5. (60) 

mfsf + 771^2 

The variance is now 

s 2 s 2 

V tt = m\m\ no 1 , 2 22 - ( 61 ) 
mfsf + 77I2S2 

When si <C S2, we recover t = mi and Vu = 777 2 s 2 , as we should. However the result 
Eq. (|60p looks rather strange: one would expect that in this limit the measurements mi 
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and 1x12 to be simply weighted by 1/s 2 and in analogy to the weighting Eq. (J7J) 

in the single experiment case, and the variance to be a similar weighted average of mf 
and m\ (see Eqs. (|98H99|) below). Instead the weighting is more complicated, reflecting 
the bias in this method for general s\ ^ s%- Note for example that when s± = S2, t = 
mim2(mi + m2)/{m\ + m|) / |(mi + m2), and Vu = s 2 m\m\/(m\ J rm 2 l ) / \s 2 {m 2 + m 2 ). 
The result Eq. (|60p is shown in Fig. [2] for the special case m\ = 0.9 and m2 = 1.1. The 
bias is readily apparent in the asymmetry of the curve. 

The generalization to n independent experiments is straightforward in this limit: Eq. (|60j) 
becomes 

En 1 

* = V n T . (62) 



so if Sj = s, when there should be no bias, we have instead t = Y27=i m i 1 / S?=i m i 2 7^ m 
unless rrii = m. 



5 A Self— Consistent Covariance Matrix 

We saw in Sect.[3]that minimizing the x 2 Eq. (|27p constructed using a covariance matrix of 
the form Eq. (I28p gives the so-called d'Agostini bias. This bias comes from the dependence 



of the normalization term in the covariance matrix on mi and mf indeed the bias is 
proportional to the differences rrn — rrij. A possible way out, alternative to the penalty 
trick of the previous section and based on a covariance matrix approach, was suggested by 
d'Agostini in Ref. |llj . Namely, one could choose to use 

(cov t )ij = (cov)jj + s 2 t 2 , (63) 

since t is, by construction, a more precise estimator of the observable than m,j, and has 
already averaged out the differences in central value of the different measurements. We 
will now show that this method leads to results which are similar to those found using the 
penalty trick: for one experiment there is no bias, but for several experiments a bias arises. 
There is also a problem with multiple solutions. 

5.1 One experiment 

With two measurements within a single experiment, we now have a covariance matrix 

<fX"th={ \2 t 2 a 2 2+s 2 t 2 ) (64) 

so the x 2 is 

2 _ {t - mrfja 2 + t 2 s 2 ) + {t- m 2 ) 2 (a 2 + s 2 t 2 ) - 2(t - mQft - m 2 )s 2 t 2 

It is easy to check that minimizing this \ 2 with respect to t gives again t = w, where w 
is the weighted average Eq. ([7]). The variance Vu is now simply the inverse of the second 
derivative of the x 2 Eq. f)31 1) at the minimum: a straightforward but tedious calculation 
then leads back to Eq. (|4T|) . 
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It is not difficult to show that everything works out for a general covariance matrix 
Eq. (|63p for a single experiment with n data: 



idxt 



dt 



,7=1 ij=i V Ul 'v 



n n / dcov \ 

^2 (cov t^ijit-mj) - \ ^2 - m i)( cov r 1 )ifc ( J ( cov r%'(* ~ m i) 

j=i i,i,fc,i=i V /m 



n n 



i,j=l k,l=l 

which vanishes when 

t = , -i ■ (67) 

9 2 v 2 

The variance of t is given by evaluating -^f at the minimum: this yields 

V tt = 1 — . (68) 

E?j=i(covt h 

When 

(covt)ij = ^jCJj 2 + s 2 t 2 , (69) 

the inverse is 

, -|. S t 5j , . 

Kfe^-^^. ( 7 °) 



where £ is as defined in Eq. (|8|) . Thus Eq. (I67p and Eq. (168j) simplify to the familiar results 
t = w Eq. © and F« = £ 2 + s 2 u; 2 Eq. g7}. 

Note however that since xf Eq. (|3ip is no longer quadratic in t, there is also a spurious 
solution: the term in square brackets in Eq. (|66p vanishes when t = T, 2 /ws 2 which might 
be troublesome since it may lie close to the correct solution t = w whenever s 2 w 2 ~ X 2 . 

5.2 More than one experiment 

Consider now the case of n independent experiments. The covariance matrix is then 

(cov t )y = (a 2 + sfi 2 )<% , (71) 

so the x 2 is 

*?M = (72) 
i=i 4 + * 

The minimum of x 2 is found by solving the system of nonlinear equations 

(t 2 s 2 + ^ 2 ) 2 
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In general these equations will have An — 2 solutions, so for a large number of experiments 
finding the correct solution might be difficult. 

To explore the bias we consider as usual the symmetric situation cij = a, s, = s, as we 
did in Sect. 14.21 Then t is the (upper) solution to the quadratic equation 

(ms 2 t + a 2 )(t - m) = r 2 m 2 s 2 t, (74) 

so again when r is small we have a solution close to t = fh: 

t = m(l+ J^'l 2 r 2 + 0(r A )) . (75) 
\ cj z + s z m z ) 

Thus as might be expected from the shape of the x 2 the minimum is pushed upwards for 
mi 7^ m 2- The variance is also biased: 

Vu = \o 2 + \s 2 fn 2 (l + r 2 + J^r 2 + 0(r 4 )) . (76) 

The biases can again be cast in the form of Eq. flSTJ) and Eq. (j58[) but now with 

c(x) = — — , v(x) = ——. (77) 
x + 1 a; + 1 

They are compared to those of the penalty trick method in Fig. [TJ 

Another interesting special case is when the normalization errors are dominant, so (for 
n experiments) s^m^ 3> Cj. The solution for t then reduces to 



El 



(78) 



So when one of the Sj is very large, this experiment decouples as expected. However the 
weighting is still biased: when = s, t = Yl.i 171 ! I 'J2i m i ^ unless all = m. The 
result Eq. (|78|) is also plotted in Fig. [2 it is instructive to compare it with the superficially 
similar result Eq. (|62p obtained with the penalty trick, which was also biased, but in the 
opposite direction. 

Thus the self-consistent covariance matrix discussed here is biased when used for more 
than one experiment, like the penalty trick method of Sect. HI Since these methods are 
biased in the Hessian approach, they would also fail to provide a suitably unbiased error 
function for fitting to Monte Carlo replicas. Moreover these methods have multiple solu- 
tions, which also make them difficult to implement in a Monte Carlo, since if some of the 
replicas are fitted to the wrong solution, these will clearly also lead to an incorrect final re- 
sult. In the next section, we will present a new method which is free of multiple solutions, 
is unbiased when uncertainties are equal, but which unlike the method of Sect. 12.31 also 
correctly weights the different experiments according to their normalization uncertainties 
when these are unequal. 
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6 Unbiased Fitting 



The biases and multiple solutions found in the previous section come from the fact that the 
X 2 (t) function used in the fitting is no longer a quadratic function of the observable t being 
fitted, and thus the distribution of exp(— ^Xt (*)) i s no longer Gaussian. The dependence 
of the covariances of the data on t 2 distorts the shape of the x 2 > an d thus introduces a 
bias. The only way to avoid this is to hold the covariance matrix fixed when performing 
the fitting. We can do this by evaluating the covariance matrix using some fixed value to 
rather than t. The value of to can then be tuned independently to be consistent with the 
value of t obtained from the fit. The basic idea is then to determine to self-consistently 
in an iterative way. The use of theoretical estimates to avoid d'Agostini bias through an 
iterative procedure can also be found in the treatment of multiplicative systematic errors 
by the HI Collaboration [13] . 

The necessity for such a procedure is particularly clear in the Monte Carlo approach. 
When we fit to replicas, we do so on the assumption that each replica is generated with a 
given distribution of the data central values according to their experimental uncertainties. 
Clearly this uncertainty should be fixed once and for all, not vary from replica to replica. 
We will now show that this procedure indeed gives unbiased results for equal normalization 
uncertainties and that it also satisfies the decoupling criterion, for both the Hessian and 
Monte Carlo methods. Finally we also show that the iterative determination of to converges 
very rapidly, and thus that the method is also practical. 



6.1 One experiment 

For a single experiment, in place of Eq. (|63|) the covariance matrix is chosen to be 

(cov to )y = (cov)ij + tgs 2 , (79) 

where to should be viewed as a guess for t, to be fixed beforehand. In a Hessian approach 
the \ 2 is then 

n 

xl(t) = J2 (* - "wXcoOvC* - m ;)' ( 8 °) 

and minimization is trivial: 

= 'EiJ=l( calv t 1 )i3 m 3 

E" i= i(cov i0 , tJ 

while 



u > (81) 



V tt = — . (82) 

Eij = l(cOV t0 )ij 

For the special case in which the data have uncorrelated statistical errors only, Eq. ([5]) , using 
the inversion Eq. (|70p (with t replaced by to) we thus recover t = w Eq. (JT]), independent 
of the value chosen for to, while 

V tt = X 2 + s 2 t 2 . (83) 



This reduces to Eq. (|47l) . but only if we tune to = t. 
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In a Monte Carlo approach the x f° r each replica is 



xl(T) = E ( T " NM i)( c °n 1 )ij(T - NMj), (84) 
and minimization is again trivial: 

f ~ — / =n • C°5J 

Li,j=i(cov to )y 

The expectation value and variance of t can now be straightforwardly evaluated: 



E™j=i( co O" 



E[t] = ^: 1 ^ t °!?'\ (86) 




Var[t] = (1 + ^)| 1 -s 2 ^ ] + s 2 E[t] 2 . (87) 

This coincides with the Hessian result Eqs. (|81ti82p when to = E[t]. Also it reduces again 
to the old results E[t] = w Eq. © and Var[t] = (1 + s 2 )S 2 + u> 2 Eq. ((22J) for uncorrected 
statistical errors Eq. ©. These results are then quite independent of the value of to, 
explaining the success of the old error function Eq. (|16p . Note however that the value of 
the x 2 Eq. ([SU]) at the minimum for each replica will still depend on to, and will indeed 
be quite different from that of the error function Eq. (I16j) . to which it reduces only when 
t = 0. 

6.2 More than one experiment 

When we have n independent experiments (each with one data point), we now choose in 
place of Eq. flU} 

(cov t )y = (a* + s 2 i t 2 )5 ij , (88) 

so the x 2 is now 

X 2 = X2^£, (89) 

whence we have the single solution 







i ' 






l 




i 







and 

For the special case Sj = s, a\ = a, these reduce to t = fh and Vu = ^(f 2 + s2 ^o) as 
they should: the fit is unbiased, and the variance is correctly estimated provided only that 
to = t. When instead the normalization uncertainties dominate, 

, - £±1?, y u = -J^, ( 92 , 



En LL r-\n 

i=l 7? 2^i=l 
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and thus the central value exhibits decoupling (if some Sj is much larger than the others, 
t becomes independent of the corresponding measurement mj), and the variance is again 
correctly estimated whenever to = t. 

In the Monte Carlo approach the x 2 f° r a given replica is now 

i= i °i + s i l 

Minimization is again straightforward: 



En I _ 

i=1 vf+sfti 



v^n 1 



(94) 



whence on averaging over replicas 



En 



E[t] = ^ n "'7'*° , (95) 
Var[t] = — 1 1 ' ° J 2 . (96) 



The central value coincides with the Hessian result Eq. (|90p . while the variance is now 
determined more accurately: Eq. ([96|) reduces to Eq. (f9Tj) whenever mf + a 2 ~ t^. 

In the special cases considered previously, in the symmetric case Sj = s, Oi = a we have 

E[t]=m, Var[i] = ±(<r 2 (l + s 2 ) + s 2 m 2 (l + r 2 )), (97) 

which is indeed unbiased. Note that unlike in the Hessian method, the variance cross 
term is now properly included (see Eq. (119p ). the spread of values rrtj now contributes to 
the variance, as it should (see Eq. flU}), an d both the expectation value and variance are 
actually independent of to, just as they were for the case of a single experiment. 
When the normalization uncertainties dominate, of <C sjt 2 , 



E[t] = ^r^r, (98) 

Var[t] = - t=1? (99) 

v^n 1 1 

The result Eq. (f9"gj) is compared in Fig.[2]to the biased results Eq. (fTSj) and Eq. ([Blip obtained 
previously. It is manifestly unbiased. Furthermore, when the normalization uncertainty of 
one of the experiments is particularly large, this experiment now decouples from both the 
mean and the variance just as it should. Note again that the results Eq. (|98p and Eq. (|99p 
are entirely independent of to : in the Monte Carlo method to only controls the relative 
balance between the statistical and normalization errors, and then only when these are 
different amongst themselves. 
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6.3 Determining t 



The results of Sect. 16. 1116.21 imply that the to-covariance matrix Eq. (f79|) gives a x 2 function 
that can be used to give unbiased fits to the individual replicas, with to controlling the 
relative balance between statistical and normalization errors, both in the Hessian and Monte 
Carlo method. The remaining difficulty with this approach is that to is n °t determined self- 
consistently within the minimization, but rather must be fixed beforehand. Clearly if the 
value chosen is incorrect, this may itself lead to an incorrect fit. Indeed, we have seen that 
the Monte Carlo and Hessian results with the to-method lead to the same central prediction 
only when to = Eft]. 

However, the dependence on to is rather weak. That this is the case is qualitatively 
clear: firstly to only determines the uncertainties, so an error we make in to is a second 
order effect; furthermore all dependence on to cancels when all the rrii are equal, when <jj 
and Si are equal, or when normalization errors dominate over statistical, or indeed vice 
versa. To make this more precise, consider a small shift to — > to + 5tQ. The corresponding 
shift in Eft] Eq. (|95l) (or its Hessian counterpart Eq. (|90l) is then given by 



As expected this vanishes when all the mi are equal, in the symmetric case Sj = s, Oi = a, 
and in the limits when either statistical or normalization uncertainties dominate. Elsewhere, 
we expect it to be very small. 

To quantify this, we first note that when combining a large number of experiments, so 
that n in Eq. f| 1 00 [) is large, the result is essentially independent of n (since both numerator 
and denominator grow as n 2 ), so a typical contribution to the sums may be taken as 
indicative of the overall result. Now for any pair of measurements, 2to(mj — rrij) ~ mf — 
m 2 will typically be of the same size as the uncertainties in the measurements, so if all 
percentage uncertainties are of the same order A, i.e. Sj ~ A, Oi ~ Ato, then Eq. (|100l) 
gives <5E[t] ~ A 2 (5to- So since A is always rather less than one, we always expect 5E[t] <C <5to- 

It follows that to can be determined iteratively: a first determination of t is performed 
with a zeroth-order guess for to, such as, for example, to = 0, which in the Monte Carlo 
approach corresponds to the simple choice Eq. (|24p in which normalization uncertainties 
are not included in the error function. The result for E[t] thus obtained is used as to for a 
second iteration, and so on. Since <5E[t] ~ A 2 <5to, and A <C 1, it is clear that these iterations 
will converge rapidly to a result within the envelope of the purely statistical uncertainties 
on E[t]. Moreover, since as we saw in the previous section Eft] is unbiased for any to, the 
final result (which has to = E[t]) will also be unbiased. Note incidentally that this also 
implies that the to-method must give a result different in general to that obtained with the 
penalty trick, since the latter is biased (see Fig. 2). 

The estimate Eq. (jlOOp then suggests that already the first iteration should be sufficient 
to provide a determination with a relative uncertainty of A 2 for data affected by typical 
relative uncertainties A, i.e. the procedure is expected to converge at the first iteration for 
all practical purposes. 

In table [1] we summarise the features of our new to-method compared to those of all 
the other procedures discussed in this paper, based on the two criteria of freedom from bias 



E 



{m i -m j )(s^cr^-cr^s 2 j ) 



5E[t] = 6t 




2 



(100) 
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method | E or x z 


Bias for = s. cr^ = a 






1 expt, n data 


n expts 


2 expts 


MC with t = 


CDS, 121} 


1 


1 


t> (mi + TO2) 


Hessian with d'Agostini bias 


pili. poll 


l 

l + ns 2 


i 2s 2 a 2 , 

^ ■> i ■> - ■> T~ ■ ■ • 


l/misf+l/m 2 S2 


Hessian with penalty trick 


@5j,lg8} 


1 


s 2 (<r ;,! -2s 2 m ;i ) 


l/mi -fl/mj s 2 


t-cov. mat. 


(65}, CE} 


1 


S 2 (T 2 

1 + <7 2 +sW +••• 


m l/ s l+ m 2/ s 2 


to-cov. mat. 


J84ll . (1931 


1 


1 


m l/ s l+ m 2/ s 2 



Table 1: Summary of the various methods for the inclusion of normalization uncertainties. The sec- 
ond column provides the reference to the pair of E- or ^-functions whose minimization determines 
the results for one experiment and many experiments in each case. Results for the bias (defined in 
the text) are given assuming the data asymmetry r 2 m a 2 /in 2 . 

when uncertainties are all equal, and decoupling when the normalization uncertainty of one 
experiment is very large, as presented in the end of Sect. 12. 11 Specifically, we consider the 
Monte Carlo method with normalization uncertainties not included in the error function 
discussed in Sect. 12.3] the d'Agostini-biased Hessian method of Sect.[3j the Hessian method 
with penalty trick presented in Sect. |4j the self-consistent covariance matrix method (t- 
cov) discussed in Sect. [5l and finally the io^rnethod discussed in this section. The two 
central columns in the table give the ratio of the central value to the unbiased results 
Eq. (I86p and Eq. (I95p in the one-experiment and n-experiment cases respectively, thereby 
exposing the bias whenever the ratio is not equal to one. The result is given for uncorrelated 
systematic uncertainties, all equal to a, and normalization uncertainties all equal to s, and 
assuming that the difference between measurements r Eq. (|42[) is r 2 ~ a 2 /fh 2 . The last 
column gives the central value for a fit to two experiments, in the limit where normalization 
uncertainties are dominant and other uncertainties can be neglected, in order to show how 
(or if) an experiment decouples when its normalization uncertainty becomes very large. 

6.4 Likelihood 

So far in this paper we have taken a naive approach, in which we constructed least squares 
estimators, and minimised them with respect to the theoretical prediction t. It is interesting 
to consider instead how one might construct a likelihood function for the measurements and 
normalizations, and thus whether any of our estimators are maximum likelihood estimators. 

Consider for definiteness the case of several experiments, with measurements mi and 
variances Cj. In presenting the measurements in this form, there is an underlying assump- 
tion that the measurements are Gaussian. The likelihood is then simply defined as the 
probability that the measurements take the observed values given a certain theoretical 
value t for such measurements: 

P{m\t) = iVexp ( - | £ ^1L\ , (101) 

i=l a i 

where N is some overall normalisation factor for the probability, dependent on <7j but not 
on rrii or t. Of course this probability is just TV exp(— ^x 2 (t)), so the maximum likelihood 
estimator is found by minimising the x 2 i i- e - it is the same as the least squares estimator. 
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Now consider what happens when there are also normalization uncertainties rij with 
variances Sj. Again in the absence of further information it is natural to assume these 
are also Gaussian. Furthermore they are clearly entirely independent of the measurement 
uncertainties, since the physics involved in determining the normalization is generally quite 
independent of that related to the measurements rrii or indeed the theoretical value t. Thus 
the total likelihood should factorise: P(m, n\t) = P{m\t)P(n). The maximum likelihood 
estimators obtained from P(m,n\t) and P(m\t) should thus be the same. 

Now the Hessian method of Sec. [3l by adopting the x 2 -function Eq.([4"0"|). assumes for 
the likelihood 



This is incorrect, because it is no longer Gaussian in rrii, and indeed not even properly 
normalised (to normalise it, N must depend on t, and then maximising P m (m\t) is no 
longer the same as minimising the \ 2 )- Thus the x 2 -function Eq. (l40p is not a maximum 
likelihood estimator, principally because the probability distribution it assumes is skewed 
by the normalization uncertainties. 

Similarly the Hessian method with penalty trick presented in Sec. HI by adopting the 
error function Eq. (|48p . assumes for the likelihood 



This is also incorrect, because now the likelihood, while Gaussian in rrii, is n °t Gaussian 
in rii, and furthermore cannot be factorised into a product P(m\t)P(n). Once again it is 
not properly normalised: ./V must depend on t. So this too does not give us a maximum 
likelihood estimator. The main problem here is that since the model for the likelihood 
does not factorise, the assumption of a common theoretical result t introduces artificial 
correlations between the normalization measurements rif, this leads to biases since these 
measurements are in principle completely independent. This is why the penalty trick, 
while giving correct results for a single experiment with only one overall normalization 
uncertainty, fails when applied to several independent experiments. 

The self-consistent covariance matrix method (i-cov) discussed in Sec. [5l with x 2 ~ 
function Eq. (172ft . assumes the likelihood to be 



This is better, because it is now Gaussian in rrii. However the normalization N still depends 
on t, and thus the maximum likelihood estimator no longer minimises the x 2 > but has a 
correction from In N(t). It may thus be possible to construct the maximum likelihood 
estimator by this route, but it is difficult because everything is so nonlinear in t. 

Finally consider the to _me thod discussed in this section, which takes as its starting 
point the x 2 -f unc tion Eq. (|89|) . Here the assumption for the likelihood is 




(102) 




(103) 




(104) 




(105) 
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This is now correct: it is Gaussian in mj, properly normalised, and all we have to do to 
make sure we get the correct answer is choose to consistently. Note that it can alternatively 
be formulated as 

P t0 {m,n\t) = Nexpl , 2 ,,2 C 2 + 12 )• ( 106 ) 

i=l a i S i S i J J 

This is rather like the penalty trick Eq. (|103p . but with the rij in the first term replaced 
with its best estimate no, just as in going from m-cov Eq. (|102p to to _cov Eq. (1105p we 
replace m; in the denominator with our best estimate to- Note of course that actually 
riQ = 1: this is the natural choice for presenting the data that all experimentalists choose. 
Eq. (|106p is also a good definition of the likelihood: it is Gaussian in both mi and nj, the 
normalization is determined quite independently of t, and it factorises correctly into the 
product Pt (m\t)P(n). Thus when we minimise wrt t, Pt (m\t) and Pt (m,n\t) give the 
same maximum likelihood estimator for t, as they should. 

Since the to _me thod yields the maximum likelihood estimator, it possesses all the nice 
asymptotic properties of maximum likelihood estimators: in particular it is consistent and 
unbiased, and asymptotically efficient (see for example Ref. |14j). 

7 Application to PDF determination 

As mentioned in the introduction, in PDF determinations an accurate result is sought while 
combining many different sources of uncertainty for a large number of experiments. Typical 
uncertainties in a global parton fit are summarized in Table [2] for deep-inelastic scattering 
(DIS) experiments and Table [3] for hadronic experiments. Specifically, we list in Tables [2] 
and [3] the uncertainties of the DIS data included in the NNPDF1.2 [7] analysis, and those of 
the Drell-Yan, weak boson production and jet data included in the NNPDF2.0 [15] global 
fit. The total number of data points included in these sets is about 3000 for the NNPDF1.2 
analysis, and about 3500 for NNPDF2.0. The datasets used of other recent global parton 
fits |161I17] are similar. For DIS data, all uncertainties including normalizations are of the 
order of a few percent, while for hadronic data uncertainties vary widely and can be as large 
as 20%. We expect therefore the impact of a full treatment of normalization uncertainties 
to be moderate when fitting DIS data, and more dramatic in truly global fits which include 
hadronic data. 

7.1 Impact of normalization uncertainties 

Normalization uncertainties have been included in recent PDF determinations either by 
using the penalty trick with the Hessian method in recent MSTW [16] and CTEQ |17pi8] 
fits, or using the Monte Carlo method with the to = error function Eq. (|24p in recent 
NNPDF fitfl pSE]. 

As we have seen, neither of these procedures is entirely satisfactory. However, the bias 
in the penalty trick method used by MSTW and CTEQ, with uncertainties such as those 

3 In fact, the error functions used by NNPDF differs from Eq. (|24[) by a factor of Nf in the denominator, 
ft is not difficult to see that the only effect of this extra factor is to introduce a small downward bias of 
relative order sf (so ~ 0.01% for the data in NNPDFL2) in E[t] and Var[t]. 
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DIS data (NNPDF1.2 [7] data set) 


Experiment 


Set 


STAT (%) 


SYS (%) 


NORM (%) 


TOT (%) 


NMC-pd 














NMC-pd 


2.0 


0.4 


0.0 


2.1 


NMC 














NMC 


3.7 


2.3 


2.0 


5.0 


SLAC 














SLACp 


2.7 


0.0 


2.2 


3.6 




SLACd 


2.5 


0.0 


1.8 


3.1 


BCDMS 














BCDMSp 


3.2 


2.0 


3.2 


5.5 




BCDMSd 


4.5 


2.3 


3.2 


6.6 


ZEUS 














Z971owQ2 


2.5 


3.0 


2.1 


4.7 




Z97NC 


6.2 


3.1 


2.1 


7.8 




Z97CC 


33.6 


5.6 


2.0 


34.2 




Z02NC 


12.2 


2.1 


1.8 


12.7 




Z02CC 


38.6 


6.2 


1.8 


39.3 




Z03NC 


6.9 


3.2 


2.0 


8.3 




Z03CC 


29.1 


5.6 


2.0 


29.8 


HI 














H197mb 


2.8 


2.0 


1.7 


3.9 




H1971wQ2 


2.7 


2.5 


1.7 


4.2 




H197NC 


12.5 


3.2 


1.5 


13.3 




H197CC 


27.5 


4.6 


1.5 


28.1 




H199NC 


14.7 


2.8 


1.8 


15.2 




H199CC 


25.5 


3.8 


1.8 


25.9 




H199NChy 


7.2 


1.7 


1.8 


7.7 




H100NC 


9.4 


3.2 


1.5 


10.4 




H100CC 


20.4 


3.8 


1.5 


20.9 


CHORUS 














CHORUSnu 


4.2 


6.4 


7.9 


11.2 




CHORUSnb 


13.8 


7.8 


8.7 


18.7 


FLH108 














FLH108 


47.2 


53.3 


5.0 


71.9 


NTVDMN 














NTVnuDMN 


16.2 


0.0 


2.1 


16.3 




NTVnbDMN 


26.6 


0.0 


2.1 


26.7 


ZEUS-H2 














Z06NC 


3.8 


3.7 


2.6 


6.4 




Z06CC 


25.5 


14.3 


2.6 


31.9 



Table 2: Different sources of uncertainty for data included in the NNPDF1.2 [7] analysis, which is 
representative of the deep-inelastic scattering data included in typical parton fits. All uncertainties 
are given as a percentages, obtained as averages over all points of the percentage values in the corre- 
sponding set; "stat" denotes all uncorrelated uncertainties (mostly statistical but also uncorrelated 
systematics) ; "sys" uncertainties which are correlated between all experiments in a set; "norm" 
multiplicative (normalization) uncertainties which are also correlated between all experiments in a 
set; and "tot" the average of the sum in quadrature of all uncertainties. Normalization uncertainties 
are fully correlated between the NMCp and NMCd datasets and cancel in the NMC-pd ratio. 
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Hadronic data (NNPDF2.0 15 data set) 


Experiment 


Set 


STAT (%) 


SYS (%) 


NORM (%) 


TOT (%) 


DYE605 














DYE605 


16.6 


0.0 


15.0 


22.6 


DYE886 














DYE886p 


20.4 


0.0 


6.5 


22.1 




DYE886d 


18.3 


0.0 


6.5 


20.6 




D Y.h886r 


3.6 


1.0 


0.0 


3.8 


CDFWASY 














CDFWASY 


4.2 


4.2 


0.0 


6.0 


CDFZRAP 














CDFZRAP 


5.1 


6.0 


6.0 


11.5 


DOZRAP 














DOZRAP 


7.6 


0.0 


6.1 


10.2 


CDFR2KT 














CDFR2KT 


4.5 


21.1 


5.8 


23.0 


D0R2CON 














D0R2CON 


4.4 


14.3 


6.1 


16.8 



Table 3: Same as Table [2] but for the hadronic data which is included, in addition to the DIS data 
of Table [2j in the NNPDF2.0 [15] analysis. Again, this dataset is representative of the hadronic 
data included in typical parton fits. Normalization uncertainties are fully correlated between the 
DYE886p and DYE886d datasets and cancel in the DYE866r ratio. 

of Tables [2] and [3] is likely to be at or below the percent level for DIS data, though it could 
be non-negligible for some hadronic (in particular Drell-Yan) data. Similarly, the error in 
using the Monte Carlo method with to = can be estimated using Eq. (jlOOp which for DIS 
experiments gives 5E[t] ~ 0.01 St . Hence, the NNPDF1.0 and NNPDF1.2 fits performed 
in Ref. [BE] to DIS data are expected to be within a few percent of the true result. The 
deviation is more significant for hadronic data. 

7.2 Implementation of the t method 

In order to test our general arguments, and also as a practical illustration of the new to 
method we have implemented it in the NNPDF framework and repeated the NNPDF1.2 
fit [7] with the to method. As a starting point we take the old NNPDF1.2 fit, and thus 
to = 0. We then performed several iterations in each one taking the central value of the 
previous fit to determine to, and thus the covariance matrix Eq. ()T9[) : specifically itel thus 
takes to to be the central value of NNPDF1.2 (referred to hereafter as iteO), ite2 takes to to 
be the central value of the itel fit, and so on. In this way we can assess the convergence of the 
method. Three iterations proves to be more than sufficient. At each iteration we produce 
one thousand replicas: the uncertainty on to is then the overall pdf uncertainty divided by 
VTOOO, which is sufficiently small that random fluctuations are kept under control. 

In Fig. [3] we show the results for the central pdfs (i.e. to) in the various iterations, iteO, 
itel, ite2 and ite3 normalized with respect to ite3. We show, besides the overall one-sigma 
PDF uncertainty of ite3 (the very broad band in the plots), also the expected range for 
the fluctuations of the to iterations in the convergence regime (the relatively narrow band) . 
As expected, convergence is reached very quickly, essentially at the first iteration: while 
the original central value (iteO) often lies some distance from the central narrow band, all 
subsequent iterations lie more or less within it. Note however that even iteO lies essentially 
within the broad PDF uncertainty band: the shift in central values is generally within one 
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Figure 3: Comparison between NNPDF1.2 PDFs and the results of the three first ^"iterations. In 
each case the PDFs shown have been used to compute the to covariance matrix for the next iteration. 
Each iterations has been computed with 1000 Monte Carlo replicas. We show the one-sigma PDF 
uncertainty band for ite3, as well as the same band divided by ylOOO, which corresponds to the 
expected statistical fluctuations of the PDF central values (hatched area). For the singlet and the 
gluon PDFs plots displaying both the small x and large x regions are shown. 
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itel 


ite2 


ite3 




9.2 


2.6 


2.8 


g{x,QV) 


2.9 


1.9 


2.1 


T 3 (x,QZ) 


7.5 


0.9 


0.7 


V(x,QZ) 


1.1 


0.8 


2.8 


A s (x,Qt) 


14.4 


1.6 


1.2 


s+(x,Qo) 


2.8 


3.5 


3.3 


8-(x,Qo) 


1.9 


1.2 


2.6 



Table 4: The stability distances for various PDFs for the various iterations of the NNPDF1.2 to 
fits, all computed from 1000 replicas. For each fit, distances are computed with respect to the fit 
from the previous column (itel being with respect to NNPDF1.2). The distances are averaged over 
the data regions, as defined for each PDF in [7]. 



Experiment 


iteO 


itel 


ite2 


ite3 


Total 


1.32 


1.25 


1.25 


1.25 


SLAC 


1.33 


1.30 


1.31 


1.29 


BCDMS 


1.57 


1.44 


1.42 


1.44 


NMC 


1.71 


1.66 


1.67 


1.66 


NMC-pd 


1.73 


1.30 


1.28 


1.29 


ZEUS 


1.05 


1.02 


1.03 


1.03 


HI 


1.02 


0.99 


0.99 


1.00 


CHORUS 


1.38 


1.31 


1.32 


1.33 


FLH108 


1.65 


1.67 


1.67 


1.67 


NuTeV Dimuon 


0.65 


0.66 


0.65 


0.68 


ZEUS HERA-II 


1.53 


1.49 


1.49 


1.49 



Table 5: The % 2 per degree of freedom, both total and for individual experiments, for the various 
iterations of the NNPDF1.2 t fits, always with 1000 replicas; "iteO" denotes the starting NNPDF1.2 
fit. 

sigma of the overall uncertainty. The only exception is the triplet T3 in the valence region. 

These statements may be made more quantitative by looking at the distances between 
the various curves, in units of their standard deviations combined in quadrature (as defined 
in Appendix B of [4]). Distance equal to one means that two curves are within one sigma of 
each other, so the average distance between a random sample of curves should tend roughly 
to one in the limit of large samples. The distances between itel and iteO, ite2 and itel, and 
ite3 and ite2 are shown as a function of x in Fig. U] for the various PDFs considered, and in 
Table U] averaged over x (the average is performed by sampling the distance at ten values 
of x in the data region, see Ref. [7]). Again the convergence at itel is apparent: only the 
distance between itel and iteO is ever appreciably larger than unity. 

It is clear from this analysis that the PDFs which are most affected by the inclusion of 
normalization uncertainties in the fits are the singlet £ in the region 0.001 < x < 0.1 (where 
there is tension between NMC and HERA), and the triplet T3 and sea asymmetry As in 
the region 0.04 < x < 0.4 (due mainly to tension between BCDMS, NMC and CHORUS). 

In Table [5] we show the value of the x 2 P er degree of freedom in the reference fit iteO 
and in the various iterations of the to fit. We note the improvement in the fit quality 
from the better handling of normalization uncertainties, particularly in BCDMS, NMC and 
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9(x,Q 2 ) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 




o * 1 1 1 1 1 1 1 o 1 — ■ — ■ — 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.001 0.01 0.1 



Figure 4: Distance between pairs of PDFs in subsequent iterations of the to method. For the singlet 
and the gluon PDFs the distances are shown both at small x and at large x. 
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CHORUS. The table also shows that the x 2 does not improve after the first iteration, which 
provides further evidence for the convergence of the to method at the first iteration. 

Finally, the PDFs from the starting to = NNPDF1.2 fit and those of the final iteration 
(ite3) of the to method, together with their respective uncertainties, are compared in Fig. [5j 
As expected from the distance analysis the most important shifts in central values due to the 
inclusion of normalization uncertainties may be seen in the singlet, the triplet, and the sea 
asymmetry, though even these are generally small (ie around one-sigma). It is interesting 
to observe that these changes in central values of the triplet in the valence region, and of 
the sea asymmetry, are accompanied by a reduction in the overall uncertainty, due to the 
improved compatibility between the various datasets once normalization uncertainties are 
properly taken into account. 

8 Conclusions 

We have studied various methods for the inclusion of multiplicative normalization uncer- 
tainties in combined fits to multiple data sets, using both Hessian and Monte Carlo methods, 
specifically but not necessarily in view of applications to PDF determination. We reviewed 
how the simplest approach of including the normalization uncertainties in the covariance 
matrix leads to the well-known "d'Agostini bias" [9], and showed that the commonly used 
penalty trick method, while fine for the analysis of a single set of experimental data, can 
lead to biases when combining several independent data sets. We then developed a new 
technique, the £o _me thod, in which normalization uncertainties are introduced into the co- 
variance matrix in such a way that the results are unbiased. While this technique requires 
iteration to self-consistency, we showed that in practice the convergence is very fast, so 
that only one iteration is generally required. 

To demonstrate the practical application of the io _me thod, we implemented it in the 
most recent published parton fit by the NNPDF collaboration, NNPDF1.2 [7j. This con- 
firmed the rapid convergence of the technique, showed that the inclusion of normalization 
uncertainties can lead to a small improvement in the quality of the fit through the resolution 
of tensions between datasets, and moreover that where these tensions are significant this 
can lead to a subsequent reduction in PDF uncertainties. 

We note that the io- me th°d, while very well suited to the determination of PDF uncer- 
tainties by Monte Carlo methods, could also be used in the more traditional Hessian fitting 
methods, where it would lead to faster minimization (since in the to- me th°d the dataset 
normalizations are not fitted), and more reliable central values (since unlike the penalty 
trick the io- me thod is free from systematic bias). However, the Hessian estimate of uncer- 
tainties is still a little less reliable than that from the Monte Carlo method, since quadratic 
cross- variance terms (such as in Eq. (I19j) ) are always missing, and Gaussian distributions 
of PDF parameters are always implicitly assumed. 

The to- me th°d has now been used in the global NNPDF fits [15], where the compatibility 
of deep-inelastic and hadronic data is a relevant issue. As expected, here the inclusion of 
the normalization uncertainties results in a significant improvement in the quality of the fit 
to the hadronic data sets. 

Note that the d'Agostini and penalty trick biases discussed in this paper affect all mul- 
tiplicative errors, not only overall normalizations. Many of the systematic errors in cross- 
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Figure 5: Comparison between the PDFs obtained from the ite3 NNPDF1.2— to fit with the 
standard (to = 0) NNPDF1.2 [7] fit. The quantity plotted is the ratio of the difference between the 
NNPDF1.2-i and NNPDF1.2 results to the NNPDF1.2 itself. 
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section measurements are closer to multiplicative than additive (see for example Ref. |13j). 
The £o _me thod might thus be developed into a general technique to obtain bias free fits to 
data sets with a variety of multiplicative systematic uncertainties. 
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